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Abstract 

We exhibit the construction of a deterministic automaton that, given k > 0, recognizes the (regular) language 
of fc-differentiable words. Our approach follows a scheme of Crochemore et al. based on minimal forbidden 
words. We extend this construction to the case of C°°-words, i.e., words differentiable arbitrary many times. 
We thus obtain an infinite automaton for representing the set of C°°-words. We derive a classification of 
C°°-words induced by the structure of the automaton. Then, we introduce a new framework for dealing with 
C°°-words, based on a three letter alphabet. This allows us to define a compacted version of the automaton, 
that we use to prove that every C°°-word admits a repetition whose length is polynomially bounded. 
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1. Introduction 

In 1965, W. Kolakoski introduced an infinite word IC over the alphabet {1, 2} having the curious property 
that the word coincides with its run-length encoding |14j: 

£= 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 ... 

2 211212 212 211211 

Indeed, it is easy to see that the run-length encoding operator has only two fixed points over the alphabet 
{1,2}, namely the right-infinite words K and 1/C. 

Kimberling [13] asked whether the Kolakoski word is recurrent (every factor appears infinitely often) 
and whether the set of its factors is closed under complement (swapping of l's and 2's). Dekking [TU] 
observed that the latter condition implies the former, and introduced an operator on finite words, called the 
derivative, that consists in discarding the first and/or the last run if these have length 1 and then applying 
the run-length encoding. The derivative is defined for those words over the alphabet {1,2} such that their 
run-length encoding is still a word over the same alphabet^ called differentiable words. 

The set of words which are differentiable arbitrarily many times, called the set of C°°-words, is then 
closed under complement and reversal, and contains the set of factors of the Kolakoski word. Therefore, one 
of the most important open problems about the Kolakoski word is to decide whether the double inclusion 
holds, i.e., to decide whether all the C°°-words appear as factors in the Kolakoski worcr] 
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Actually, the set of C^-words contains the set of factors of any right-infinite word W over the alphabet 
{1,2} having the property that an arbitrary number of applications of the run-length encoding on W 
still produces a word over the alphabet {1,2}. Such words are called smooth (right-infinite) words [3]. 
Nevertheless, the existence of a smooth word containing all the C°°-words as factors is still an open question. 

Although C°°-words have been investigated in several relevant papers [21451 ITTl IT5I ITS] , their properties 
are still not well known. Compared with other famous classes of finite words, e.g. Sturmian words, few 
combinatorial properties of C^-words have been established. Weakley [T^] started a classification of C°°- 
words and obtained significant results on their complexity function. Carpi [5j proved that the set of C°°- 
words contains only a finite number of squares, and does not contain cubes (see also |15) and [5]). This result 
generalizes to repetitions with gap, i.e., to the C°°-words of the form uzu, for a non-empty z. Indeed, Carpi 
proved [5] that for every k > 0, only finitely many C°°-words of the form uzu exist with z not longer than 
k. Recently, Carpi and D'Alonzo |5] introduced the repetitivity index, which is the function that counts, for 
every non-negative integer n, the minimal distance between two occurrences of a word u of length n. They 
proved that the repetitivity index for C°°-words is ultimately bounded from below by a linear function. 

This leads us to address the following problem: 

Problem 1. Let u,v be two Cf™ -words. Does a Cf -word exists of the form uzv? 

A positive answer to Problem [I] would improve dramatically the knowledge on the properties of C°°- 
words. For example, it would imply that for any n > 0, there exists a C°°-word containing as factors all the 
C 00 -words of length n. 

In this paper, we develop a novel approach to the study of C°°-words. The culminating point of this 
approach is an infinite graph (in fact, the graph of an infinite automaton) VlACAoo for representing the 
classes of C°°-words with respect to an equivalence relation based on the extendability of these words. In 
particular, this allows us to prove that Problem [T] has a positive answer in the case u = v. We believe that 
the new techniques introduced in this paper can give further insights on C°°-words, and hope that further 
developments can eventually lead to a (positive) solution of Problem [I] in its general form. 

We use a construction of Crochemore et al. [5] for building a (deterministic finite state) automaton 
recognizing the language L(M) of words avoiding an anti-factorial given set of words M. This procedure is 
called L-AUTOMATON. It takes as input a trie (tree-like automaton) recognizing M and builds a deterministic 
finite state automaton recognizing L(M). If M is chosen to be the set M.J r (C k ) of minimal forbidden words 
for the set C k of fc-differentiable words, the procedure builds an automaton Ak recognizing the (regular) 
language C fc . Recall that minimal forbidden words are words of minimal length that are not in the set of 
factors of a given language (see for example pQ). We show how to compute the set MT{C k+1 ) from the set 
A4J r (C k ). This leads to an effective construction of the trie of A4J r (C k ) for any k > 0, which is then used 
as the input of the L-AUTOMATON procedure for the construction of the automaton Ak ■ 

In the case k = oo , the procedure above leads to the definition of an infinite automaton Aoo recognizing 
the set of C°°-words, in the sense that any C°°-word is the label of a unique path in Aoo starting at the initial 
state. This automaton induces a natural equivalence on the set of C°°-words, whose classes are the sets of 
words corresponding to those paths in Aoo starting at the initial state and ending in the same state. We 
show that this equivalence is deeply related to the properties of simple extendability to the left of C°°-words. 
A C°°-word w is left simply extendable (cf. [IB]) if only one between lw and 2w is a C°°-word (recall that 
for any C°°-word w, at least one between lw and 2w is a C°°-word). 

In a second step, we use a standard procedure for compacting automata to define the compacted automa- 
ton CAoo- This latter automaton induces a new equivalence on the set of C°°-words, which is related to the 
properties of simple extendability of C°°-words both on the left and on the right. 

We then introduce a new framework for representing C 00 -words on a three letter alphabet. We show 
that every C°°-word is univocally determined by a pair of suitable sequences over the alphabet {0,1,2}, 
called the vertical representation of the C°°-word. This allows us to rewrite the automaton CAoo using this 
new representation. We therefore obtain the vertical compacted automaton VCAoo- This latter automaton 
can itself be further compacted, leading to the definition of the vertical ultra- compacted automaton VUCAoo- 
All these automata reveal interesting properties and are deeply related to the combinatorial structure of 
C°°-words. In particular, using the properties of the automaton VUCAoo, we are able to prove, in Theorem 
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6.2 that for every C°°-word u, there exists a word z such that uzu is a C°°-word and \uzu\ < C|u| 2 ' 72 , for 
a suitable constant C. Indeed, this proves that every C°°-word admits a repetition with gap whose length 
is bounded by a sub-cubic function. This results is dual to the previously mentioned result of Carpi on the 
lower bound of a repetition with gap. Theorem |6.2| also solves Problem [T] in the particular case u = v. 

The paper is organized as follows. In Section[2]we fix the notation and recall the basic theory of minimal 
forbidden words; we then recall the procedure L-AUTOMATON. In Section[3]we deal with differentiable words 
and C°°-words. In Section HI we describe the construction of the automata Ak, Aoa and CAaa and study 
their properties. In Section [5] we introduce the vertical representation of a C°°-word; we then describe the 
automata VCAoo and VUCAoa- In Section|6]we prove that every C^-word admits a repetition having length 
bounded by a sub-cubic function. Finally, in Section [7j we discuss final considerations and future work. 



2. Notation and background 

We assume that the reader is familiar with basic concepts and definitions of the classic automata and 
formal language theory. 

Let E = {1, 2}. A word over E is a finite sequence of symbols from E. The length of a word w is denoted 
by \w\. The empty word has length zero and is denoted by e. The number of occurrences of the letter x in 
the word w is denoted \w\ x . We note w[i] the i + 1-th symbol of a word w; so, we write a word w of length 
nasii) = w[0]w[l]---w[rt - 1]. The set of all words over E is denoted by E*. The set of all words over E 
having length n is denoted by E™. The set of all words over E having length not greater than n (resp. not 
smaller than n) is denoted by E Sn (resp. by E-"). 

Let w € E*. If w = uv for some u, v e E*, we say that u is a prefix of w and v is a suffix of w. Moreover, 
u is a proper prefix (resp. v is a proper suffix) of w if v + e (resp. u + e). A factor of w is a prefix of a suffix 
of w (or, equivalently, a suffix of a prefix). We denote by Pref(w), Suff(w), Fact(w) respectively the set of 
prefixes, suffixes, factors of the word w. 

The reversal of w is the word w obtained by writing the letters of w in the reverse order. For example, 
the reversal of w = 11212 is w = 21211. The complement of w is the word w obtained by swapping the 
letters of w, i.e., by changing the l's in 2's and the 2's in l's. For example, the complement of w = 11212 is 
W= 22121. 

A language over E is a subset of E*. For a finite language L we denote by \L\ the number of its 
elements. A (finite or infinite) language fcE' is factorial if F = Fact(F), i.e., if for any u,v e E* one has 
uv e L => u e L and v e L. A language M £ E* is anti-factorial if no word in M is a factor of another word 
in M, i.e., if for any u,v e M, u + v => u is not a factor of v. 

The complement F c = E* \ F of a factorial language F is a (two-sided) ideal of E*. Denoting by AiJ-(F) 
the basis of this ideal, we have F c = E* J\4J 7 (F)T,* . The set M.T{F) is an anti-factorial language and is 
called the set of minimal forbidden words for F . 

The equations 

F= E* \ E*MF(F)E* 

and 

MT{F) = EF n FE n (E* \ F) 

hold for any factorial language F, and show that AiJ : {F) is uniquely characterized by F and vice versa. 
Equivalently, a word v belongs to ftAT{F) iff the two conditions hold: 

• v is forbidden, i.e., v f F, 

• v is minimal, i.e., both the prefix and the suffix of v of length \v\ - 1 belong to F. 

For more details about minimal forbidden words the reader can see [2 [5] . 
A deterministic automaton is a tuple A= (Q, E, i, T, S) where: 

• Q is the set of states, 
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• S is the alphabet, 

• i e Q is the initial state, 

• T c Q is the set of final (or accepting) states, 

• 5 : (Q x S) i-* Q is the transition function. 

The extended transition function 5* : (Q x S*) •->• Q is the classical extension of <5 to words over S. It is 
defined in a recursive way by 5*(q,wa) = 5(5* (q.w),a), w e £*, a e S. In what follows, we still use 5 for 
denoting the extended transition function. 

A language L £ £* is accepted (or recognized) by the automaton .4 if L is the set of labels of paths in A 
starting at the initial state and ending in a final state. The language accepted by the automaton A is noted 
L(A). 

We say that a word »eE* avoids the language M £ £* if no word of M is a factor of v. A language 
L avoids M if every word in L avoids M. We denote by L(M) the largest (factorial) language avoiding a 
given finite anti-factorial language M, i.e., the set of all the words of S* that do not contain any word of 
M as factor. 

Lemma 2.1. Jlj/ The following equalities hold: 

• If L is a factorial language, then L(AAT(L)) = L. 

• If M is an anti-factorial language, then MJ-(L(M)) = M. 

We recall here a construction introduced by Crochemore et al. [9] for obtaining the language L(M) 
that avoids a given finite anti-factorial language M. For any anti-factorial language M, the algorithm 
L- automaton below builds a deterministic automaton A(M) recognizing the language L(M). 



L- automaton (trie T = i,T,S')) 

1. for each a e S 

2. if S'(i, a) defined 

3. set 5(i, a) = S'(i, a); 

4. set s(S(i, a)) = i; 

5. else 

6. set 5(i, a) = i; 

7. for each state p € Q \ {i} in width-first search and each aeS 

8. if 5'(p, a) defined 

9. set 5(p, a) = 5'(p, a); 

10. set s(5(p, a)) = 5(s(p), a); 

11. else if p i T 

12. set 5(p, a) = S(s(p), a); 

13. else 

14. set 5(p, a) = p; 

15. return (Q, E, i, Q \ T, (5); 



The input of L- automaton is the tri^] T recognizing the anti-factorial language M. The output is a 
deterministic automaton A(M) = (Q,Y,,i,T,5) recognizing the language L(M), where: 

• the set Q of states is the same set of states of the input trie T, i.e., it corresponds to the prefixes of 
the words in M , 



4 Recall that a trie is a tree-like automaton for storing a set of words in which there is one node for every common prefix 
and in which the words are stored in the leaves. 
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• E is the alphabet, 

• the initial state is the empty word e, 

• the set of terminal states is Q \ M, i.e., the proper prefixes of words in M. 

States of A(M) that correspond to the words of M are called sink states. The set of transitions defined 
by 5, noted E, is partitioned into three (pairwise disjoint) sets E\, E 2 and E 3 , defined by: 

• Ei = {(u,x,ux)} | ux e Q, x e E (called solid edges), 

• E% = {(it, x,v)} | u e Q \ M, x e E, ux £ Q, v longest suffix of ux in Q (called weak edges), 

• E 3 = {(u,x,u)} | u e M, xeS (loops on sink states). 

The algorithm makes use of a failure function, denoted by s, defined on the states of Q different from e. 
If u e Q, then s(u) is the state in Q corresponding to the longest proper suffix of u which is in Q, i.e., which 
is a proper prefix of some word in M. The failure function defines the weak edges (transitions in E^)- It 
follows from the construction that edges incoming in the same state are labeled by the same letter. 

Theorem 2.2. Jf^ For any anti-factorial language M, A(M) accepts the language L(M). 

Corollary 2.3. Let L be a factorial language. If M = M.T{L) , then A(A4J 7 (L)) accepts L. 

Remark 1. In what follows, we suppose that in A(M) we have pruned the sink states and all the transitions 
going to them. As a consequence, we only have two kinds of transitions: solid edges (those of the trie T) 
and weak edges (those created by procedure L-AUTOMATOnJ. 

The automaton A(M) induces on L a natural equivalence, defined by: 

u = v <=> S(e,u) = 5(e,v), 

i.e., u and v are equivalent iff they are the labels of two paths in A(M) starting at the initial state and 
ending in the same state. The equivalence class of a word w € L is denoted by [w]. Hence 

[to] = {v e L : S(e,v) = S(e,w)}. 

Lemma 2.4. Let u be a state of A(M). Let weS* such that 6(e,v) = u. Then u is the longest suffix of 
v that is also a state of A(M), i.e., that is also a proper prefix of a word in M. 

3. Differentiable words 

Let w be a word over the alphabet S. Then w can be uniquely written as a concatenation of maximal 
blocks of identical symbols (called runs), i.e., w = x 1 ^ x% • • ■X % ^ , with Xj e E and ij > 0. The run-length 
encoding of w, noted A(w), is the sequence of exponents ij, i.e., one has A(w) = i\iv-i n - The run-length 
encoding extends naturally to right-infinite words. 

Definition 1. [3] A right-infinite word W over E is called a smooth word if for every integer k > one has 
that A k (W) is still a word over E. 

The run-length encoding operator A on right-infinite words over the alphabet E = {1,2} has two fixed 
points, namely the Kolakoski word 

K.= 221121221221121122121121221121121221221121221211211221221121- 

and the word 1/C. 

We now give the definition and basic properties of C°°-words, that are the factors of smooth words. 



5 



Definition 2. [10] A word w eE* is differentiable if A(w) is still a word over E. 



Remark 2. Since E = {1, 2} we have that w is differentiable if neither 111 nor 222 appear in w. 



Definition 3. flOf The derivative is the function D defined on the differentiable words by: 



D(w) 



x 



e 



A(w) 

x2 

2x 



ifA(w 
ifA(w 
ifA(w 
if A(w 
ifA(w 



1 or w = e, 

2x2 or A (to) = 2, 



1x2, 
2x1, 
lxl. 



In other words, the derivative of a differentiable word w is the run-length encoding of the word obtained 
by discarding the first and/ or the last run of w if these have length 1 . 

Remark 3. Let u,V be two differentiable words. If u is a factor (resp. a prefix, resp. a suffix) of v, then 
D(u) is a factor (resp. a prefix, resp. a suffix) of D(v). Conversely, for any factor (resp. prefix, resp. suffix) 
z of D(u), there exists a factor (resp. prefix, resp. suffix) z' of u such that D(z') = z. 

Let fc > 0. A word w e E* is k- differentiable if D k (w) is defined. Here and in the rest of the paper, 
we use the convention that D a (w) = w. By Remark [2j a word w is fc-differentiable if and only if for every 
< j < k the word D 3 (w) does not contain 111 nor 222 as factors. Note that if a word is fc-differentiable, 
then it is also j-differentiable for every < j < k. 

We denote by C k the set of fc-differentiable words, and by C°° the set of words which are differentiable 
arbitrarily many times. A word in C°° is also called a C°°-word. Clearly, C°° = f)k>o C fe . So, for any smooth 
word W over E = {1,2}, we have that Fact(W) £ C°°. Nevertheless, it is an open question whether there 
exists a smooth word W such that Fact(W) = C°°. 

The following proposition is a direct consequence of the definitions above. 

Proposition 3.1. The set C°° and the sets CT , for any fc > 0, are factorial languages closed under reversal 
and complement. 

Definition 4. \1U§ A primitive of a word w is any word w' such that D(w') = w. 

It is easy to see that any C°°-word has at least two and at most eight distinct primitives. For example, the 
word w = 2 has eight primitives, namely 11, 22, 211, 112, 2112, 122, 221 and 1221, whereas the word w = 1 has 
only two primitives, namely 121 and 212. The empty word e has four primitives: 1, 2, 12 and 21. However, 
any C°°-words admits exactly two primitives of minimal (maximal) length, one being the complement of 
the other. 

Definition 5. \16§ The height of a C°° -word is the least integer fc such that D k (w) = e. 

We introduce the following definitions, that will play a central role in the rest of the paper. 

Definition 6. Let w be a C°° -word of height fc. The root of w is D k ^ 1 (w). Therefore, the root of w belongs 
to {1,2,12,21}. Consequently, w is said to be single-rooted if its root has length 1 or double-rooted if its 
root has length 2. 

Example 1. Let w = 2211. Since D(w) = 22, D 2 {w) = D(D(w)) = 2 and D 3 (w) = e, we have that w has 
height 3 and root 2, therefore it is a single-rooted word. Letw' = 22112112. Since D(w') = 2212, D 2 (w') = 21, 
D 3 (w') = e, we have that w' has height 3 and root 21, therefore it is a double-rooted word. 

Definition 7. Let w be a C°° -word of height fc > 1. We say that w is maximal (resp. minimal,) if for every 
< j < k-2, D J (w) is a primitive of D 3+1 {w) of maximal (resp. minimal) length. The words of height fc = 1 
are assumed to be at the same time maximal and minimal. 
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Definition 8. We say that a C°° -word w is right maximal (resp. left maximal,) if w is a suffix (resp. a 
prefix) of a maximal word. Analogously, we say that w is right minimal (resp. left minimal J if w is a suffix 
(resp. a prefix) of a minimal word. 

Clearly, a word is maximal (resp. minimal) if and only if it is both left maximal and right maximal (resp. 
left minimal and right minimal). 

Example 2. The word 2211 is minimal, since 2211 is a primitive of 22 of minimal length and 22 is a 
primitive of 2 of minimal length; the word 21221121 is maximal, since 21221121 is a primitive of 1221 of 
maximal length and 1221 is a primitive of 2 of maximal length; the word 2122112 is left maximal but not 
right maximal. Note that 2211 is a proper factor of 21221121 and that the two words have the same height 
and the same root. 



w 2211 2122112 21221121 

D(w) 22 122 1221 

D 2 (w) 2 2 2 



Any C°°-word can be extended to the left and to the right into a C°°-word [16] . That is, if w is a 
C°°-word, then at least one between lw and 2w is a C°°-word. Analogously, at least one between wl and 
w2 is a C°°-word. 

Definition 9. \lb] A C°° -word w is right doubly extendable (resp. left doubly extendable,) if both wl and 
w2 (resp. lw and 2w) are C°° -words. Otherwise, w is right simply extendable (resp. left simply extendable,). 
A C°° -word w is fully extendable iflwl, lw2, 2wl and 2w2 are all C°° -words. 

It is worth noticing that a word can be at the same time right doubly extendable and left doubly 
extendable but not fully extendable. This is the case, for example, for the word w = 1. 

A remarkable result of Weakley |i!6i is presented in the next theorem, that we slightly adapted to our 
definitions. 

Theorem 3.2. Let w be a C°° -word. The following three conditions are equivalent: 

1. w is fully extendable (resp. w is right doubly extendable, resp. w is left doubly extendable); 

2. w is double-rooted maximal (resp. w is right maximal, resp. w is left maximal); 

3. w and all its derivatives (resp. w and all its derivatives longer than one) begin and end (resp. end, 
resp. begin) with two distinct symbols. 



Example 3. Consider the C°° -word w = 121. By Theorem 3.2 w is right doubly extendable and left doubly 
extendable. Nevertheless, w is not fully extendable, since it is single-rooted. Indeed, the word 2w2 is not a 
C°° -word, since 15(21212) = 111 and thus by definition D(w) is not differentiable. 

Remark 4. A C°° -wordw is right minimal (resp. left minimal) if and only ifw and all its derivatives longer 
than two have the property that their suffix (resp. their prefix) of length three is different from 221 and 112 
(resp. different from 122 and 211). To see this, think for example of a C°° -word of the form w = w/221 for 
some w' ; then the word w'22 is a primitive of D(w) shorter than w. Hence w, or any of its primitives, 
cannot be a right minimal word. 

Lemma 3.3. Let w be a C°° -word. Then w is a right maximal word (resp. a left maximal word) if and only 
if there exists x € E such that wx (resp. xw) is a right minimal word (resp. a left minimal word). 

Moreover, if wx (resp. xw) is a right minimal word (resp. a left minimal word), then so is wx (resp. 
xw ). 
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Proof. Let w e C be a right maximal word. Since, by Theorem 3.2 w and all its derivatives longer than 
one end with two different symbols, Remark [4] proves that the words wx and wx, x e E, are both right 
minimal. 

Conversely, if wx, x e E, is a right minimal word, Remark [4] directly shows that w and all its derivatives 
longer than one end with two different symbols. Hence, always by Theorem |3.2[ w is left maximal. 



In particular, this also shows that wx is a C°°-word (since, again by Theorem 3.2 w is right doubly 
extendable), and the argument above shows that wx is a right minimal word. 

The same argument can be used for left maximal words. □ 

Definition 10. Let w be a Cf -word. A right simple extension of w is any (T -word w' of the form 
w' = wx\X2---x n , Xi e E, n> 1, such that, for every l<i<n, wxyXi-{xl i C°°. A left simple extension ofw 
is any C°° -word w' such that w' is a right simple extension of w. A simple extension of w is a right simple 
extension of a left simple extension of w (or equivalently a left simple extension of a right simple extension 
ofw). 

The right maximal extension ( resp. the left maximal extension, resp. the maximal extension ) of w is the 
right simple extension (resp. the left simple extension, resp. the simple extension) ofw of maximal length. 

Example 4. Let w = 2211, as in Example^ Then 221121 is the right maximal extension of w and 212211 
is the left maximal extension ofw. The maximal extension ofw is 21221121. 



w 2211 221121 212211 21221121 

D(w) 22 221 122 1221 

D 2 (w) 2 2 2 2 



Remark 5. Letw be a C° -word. Then the right maximal extension (resp. the left maximal extension, resp. 
the maximal extension) of w is a right maximal (resp. a left maximal, resp. a maximal) word. 

Lemma 3.4. Let w be a Cf° -word. Then every simple extension of w has the same height and the same 
root as w. In particular, then, this holds for the maximal extension of w. 

Proof. By induction on the height k of w. For k = 1, a simple check of all the cases proves that the claim 
holds. 

Let u be a word of height k > 1 and let v be a simple extension of u. We claim that the word D(v) is a 
simple extension of the word D(u). Indeed, by Remark [3j D(u) is a (proper) factor of D(v). The existence 
of a non-simple extension z of D(u) such that z is a factor of D(v) would imply, once again by Remark [3j 
the existence of a non-simple extension z' of u such that z' is a factor of v, against the hypothesis that v 
is a simple extension of u. Hence, by induction hypothesis, D(v) and D(u) have the same height and root. 
Since a word has the same root as its derivative, and has height equal to 1 plus the height of its derivative, 
the claim is proved. □ 

Lemma 3.5. Let w e C°° be a right maximal word (resp. a left maximal word) of height k > 0. Then for 
every < j < k and for every x e E, (/^(wrr)! = |Z) : '(?x;)| + 1 (resp. \D 3 (xw)\ = \D 3 (w)\ + 1). Moreover, 
if D 3 (wx) = D 3 (w)y, y e E, then D 3 (wx) = D 3 (w)y (resp. if D 3 (xw) = yD 3 (w), y e E, then D 3 (xw) = 
yD 3 (w)). 

Proof. Let w e C°° be a right maximal word. Then, by definition, the last run of w has length one, so it 
is a letter ifS. We have D(wx) = D(w)2, and D(wx) = D(w)l. Since the derivative of a right maximal 



word is a right maximal word (by Theorem 3.2), the claim follows. 



The same argument can be used for left maximal words. □ 

The results contained in this section can be summarized as follows: Let w e C°° be a right maximal word 
(resp. a left maximal word) of height k > 0. Then: 
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• if w is double-rooted, then wl and w2 (resp. lw; and 2w) are single-rooted right minimal (resp. left 
minimal) words of height k + 1 . 

• if instead w is single rooted, then there exists ieS such that wx (resp. xw) is a single-rooted right 
minimal (resp. left minimal) word and has height k + 1, whereas wx (resp. xw) is double-rooted right 
minimal (resp. left minimal) word and has height k. 

Indeed, if w is double-rooted, and its root is equal to yy, y e S, then, by Lemma |3.5[ there exists x € S 
such that D k ~ 1 (wx) = yyy and D k ^ 1 (wx) = yyy. Thus, wx and wx are single-rooted words of height k + 1 
(more precisely, the root of wx is 1 and the root of wx is 2). 

If instead w is single-rooted, and its root is equal to y e S, then, by Lemma |3.5[ we have that for a 
letter x e £ the word wx is such that D k ~ 1 (wx) = yy, and so wx has height k+1 and root 2, whereas 
D k ^ 1 (wx) = yy, and so wx is a double-rooted word of height k. 

The same argument can be used for left maximal words and extensions to the left. 

Example 5. The word w = 221121 is a single-rooted right maximal word of height 3. The word wl is a 
double-rooted right minimal word of height 3, whereas the word w2 is a right minimal word of height 4 and 
root 2. 



D(w) 

D 2 (w) 

D 3 (w) 



221121 

221 

2 



2211211 

2212 

21 



2211212 
2211 
22 
2 



Consider now the word w' = 22112112, the right maximal extension of the word wl. The word w' is a 
double-rooted right maximal word of height 3. The word w'l is a right minimal word of height 4 and root 2, 
whereas the word w'2 is a right minimal word of height 4 and root 1. 



w' 


22112112 


221121121 


221121122 


D(w') 


2212 


22121 


22122 


D 2 (w') 


21 


211 


212 


D 3 (w') 




2 


1 



4. Automata for differentiable words 

We denote respectively by MT{C k ) and M.T{G X ) the set of minimal forbidden words for the set C k 
and the set of minimal forbidden words for the set C°°. Clearly, .M-T^C 00 ) = \J k>0 M.T{C k ). 

Remark 6. It follows from the definition that a word w = xuy, x, y e S, ueE', belongs to MJ 7 ( C°°) if and 
only if 

1. xuy does not belong to C°° ; 

2. both xu and uy belong to C°° . 

Since a C°° -word is always extendable to the left and to the right, the second condition is equivalent to: 
both xuy and xuy belong to C°° . In particular, this shows that u is left doubly extendable and right doubly 
extendable, but not fully extendable, since otherwise xuy would belong to C°°. Hence, by Theorem 3.2 u is 
a (single-rooted) maximal word. 

The following proposition is a consequence of the definition. 
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Proposition 4.1. The set MT(CT) and the sets MT(C k ), for any k > 0, are anti-factorial languages 
closed under reversal and complement. 

We now give a combinatorial description of the sets of minimal forbidden words for the set of diffcrcntiable 
words. Let us consider first the set .M^C 00 ). 

Lemma 4.2. Let w e MT(C X ). Then there exists k > such that D k (w) = 111 or D k (w) = 222. 

Proof. By assumption, w £ C°°. So there exists k > such that D k (w) contains xxx as factor, for a letter 
x e S. If xxx was a proper factor of D k (w), then, by Remark [5J there would exist a proper factor of w 
which is not diffcrcntiable, against the definition of minimal forbidden word. □ 

So, analogously to the case of C°°-words, we can define the height of a word w in ftAT(C°°). This is the 
integer k + 1 such that D k (w) = xxx, for ieE. 

Not surprisingly, the set M.T(C k ) of minimal forbidden words for the set C k coincides with the set of 
words in ftAT(G°°) having height not greater than fc, as shown in the following lemma. 

Lemma 4.3. For any k > 0, the subset of M.T(C X ) of words having height less than or equal to k is the 
set 

Proof. By induction on k. Let first k = 1. We have to prove that the set of minimal forbidden words for 
the set of 1-differentiable words is equal to the set of minimal forbidden words of height 1. By definition, a 
minimal forbidden word of height 1 is a word w such that w is not differentiable but every proper factor of 
w is. This directly leads to AiT^ 1 ) = {111, 222}, and proves the basis step of the induction. 

Suppose now that the claim holds true for k > 1. By definition, M.T(C k+1 ) is the set of words w such 
that w is not (k + l)-differentiable, but every proper factor of w is. Clearly, since a (k + 1) -differentiable 
word is also a fc-differentiable word, we have M.T(C k ) c M.T(C k+1 ). By induction hypothesis, M.T(C k ) is 
the set of minimal forbidden words of height less than or equal to k. It remains to prove that every word in 
MT(C k+1 ) \MT(C k ) has height equal to fc + 1. Let w e MT(C k+1 ) \MT(C k ). Then w is fc-differentiable 
but not fc+1 differentiable. Then, by definition, D k (w) contains xxx as factor, for some letter leS. By 
the minimality of w, it follows that D k (w) = xxx, and hence w has height fc + 1. □ 

The following lemma gives a constructive characterization of the sets MT{C k ). 

Lemma 4.4. Let k > and Pk+i be the set of words v such that v is a primitive of minimal length of u and 
u is a minimal forbidden word of height k. Then one has 



MT{& +1 ) = MT{&)uP k 



Proof. By Lemma 4.3 the set M.T{C k+1 ) \ M.T{C k ) is the set of minimal forbidden words having height 
equal to fc + 1, so its elements are primitives of words in MT(C k ) . By minimality, they must be primitives 
of minimal length. □ 

Remark 7. Since every minimal forbidden word of height k gives exactly two minimal forbidden words of 
height fc+1 (one being the complement of the other) we have, for any k > 0, \M.T{C k )\ - Zi=i 2 1 = 2 k+1 - 2. 

The sets ftAT{C k ), for the first values of fc, are reported in Tabic [l] 
Another characterization of minimal forbidden words is the following. 

Lemma 4.5. The word xuy, x,y e E, aeE*, belongs to AAT( C°°) if and only if the word xuy is a minimal 
C°° -word and has root 1. 



Proof. Suppose that xuy, x,y e S, we £*, belongs to MJ 7 ^ 00 ). Then, by Lemma 4.2 there ex ists j such 
that D J (xuy) = zzz for a z e S. By Remark [6j u is a single-rooted maximal word. By Lemma 3.5 



then, 



D 3 (xuy) = zzz, and thus D J+1 (xuy) = 1. Finally, xuy is a minimal word since u is a maximal words (Lemma 



3.3) 
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Conversely, let xuy b e a m inimal C^-word of root 1. Hence there exists j > such that D 3 (xuy) = zz~z, 
for azeE. By Lemma 3.5 we have that D J (xuy) = zzz, so w i C°°. Moreover, since xuy is a minimal 
word, the word u is maximal (Lemma 3.3), and then, by Theorem 3.2 u is left doubly extendable and right 



doubly extendable. Therefore, both uy and xu are C° 
word. 



-words, proving thus that xuy is a minimal forbidden 

□ 





MT{C 2 ) 


MT(C 3 ) 






111 






222 






21212 






12121 




111 


112211 




222 


221122 


111 


21212 


11211211 


222 


12121 


22122122 




112211 


212212212 




221122 


121121121 






2121122121 






1212211212 






1122121122 






2211212211 



Table 1: The sets of minimal forbidden words for C , C 2 and C 3 . 



Lemma 4.6. A C°° -word is a proper prefix of some word in M.!F((T a ) if and only it is a left minimal word. 



Proof. Suppose that v is a proper prefix of w = xuy e M.T(C X ), x,y £ S. The word u is a maximal word 
(Remark [6]). By Lemma 3.3 xu is then a left minimal word, and thus v, which is a prefix of xu, is a left 
minimal word. So the direct part of the statement is proved. 

Conversely, let v be a left minimal word. We prove that v is a proper prefix of a minimal forbidden word 
by induction on n = \v\. The words v = 1 and v = 2 are proper prefixes respectively of 111 and 222, both 
belonging to M.T(C°°). So suppose that the claim holds true for every left minimal word of length smaller 
than n > and let v be a left minimal word of length n. Consider the word D(v). By Theorem 3.2 D(v) 
is a left minimal word. Since |-D(i>)| < \v\, by inductive hypothesis D(v) is a proper prefix of some minimal 
forbidden word w. The two shortest primitives of w are minimal forbidden words (by Lemma 4.4). Denote 
them by w' and w' . For the direct part of the statement, the proper prefixes of w' and w' are left minimal 
words. Now, v must be a prefix of either w' or w' , and this completes the proof. □ 

For every k > 0, let T{k) be the trie recognizing the anti-factorial language MT(C k ). We denote by 
„4(fc) the automaton constructed by the procedure L- automaton on input T(k). 

Theorem 4.7. For every k>0, A(k) is a deterministic automaton recognizing the language CT . 

Proof. The automaton A(k) is determi nisti c by constr uctio n. The fact that A(k) recognizes the language 



C is a direct consequence of Corollary 
Since, by Lemma 



2.3 



and Lemma 



4.4 



□ 



4.4 



the construction of A'1.7 r (C fc+1 ) from MT(C k ) is effective, we can inductively 
extend the construction of the automaton „4(fc) to the case k = oo. For this, consider the infinite trie Too 
corresponding to the set .M-T^C 00 ). Procedure L- automaton on input Too gives an infinite automaton Aoo 
recognizing the words in C°°, in the sense that any word in C°° is the label of a unique path in Aoo starting 
at the initial state. 
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Let 5 A denote the transition function of Aoo. As we already mentioned in Section [2j the automaton Aoo 
induces a natural equivalence on C°°, defined by 

u= A v <^=> 5 A (e,u) = 5 A (e,v). 

The class of a word u with respect to the equivalence above will be denoted by [u] A . Let u be a state of Aoo- 
Since u is the shortest element of its class [u] A , we have that u is a proper prefix of a word in .M-T^C 00 ), 



and thus, by Lemma 4.6 tiisa left minimal word. 



Proposition 4.8. Let u be a state of Aoo. Let v e C°° . Then u = A v if and only if v is a left simple 
extension of u. 

Proof. Let v e C°°. By construction, the state u = 8{e, v) is the longest suffix of v which is also a state of Too, 
and so, by Lemma |4.6[ u is the longest suffix of v which is a left minimal word. Suppose by contradiction 
that v is not a left simple extension of u. This implies that there exists a suffix u' of v such that > |u| 



and lu' and 2u' are both C°°-words. By Theorem 3.2 v! is a left maximal word. By Lemma 3.3 this would 



imply that v' has a suffix xu' , leE, which is a left minimal word longer than u. The contradiction then 
comes from Lemma [2~4l and Lemma H. 61 □ 



Corollary 4.9. Let u be a left minimal word. Then 

[u] A = {Suff(v) n where v is the left maximal extension of u}. 

We now describe the transitions of Aoo- Let u and v be two states of Aoo- By Lemma |4.6[ u and v are 
left minimal words. Let x e E. If (u,x,v) is a solid edge, then clearly v = ux, by definition. The weak edges, 
created by procedure L-AUTOMATON, are instead characterized by the following proposition. 

Proposition 4.10. Let u and v be two states of Aoo and let if E. // the transition (u, x, v) is a weak edge, 
then: 

1. u is a left minimal and right maximal word and is double-rooted; 

2. v is a minimal word and has root 2. 



Proof. By Lemma 4.6 u and v are left minimal words. By procedure L-AUTOMATON, since the transition 
(it, x, v) is a weak edge, the word ux is not a word in the trie Too- So ux is not a proper prefix of a minimal 
forbidden word. Then, by Lemma 14.61 ux is not left minimal. 



Let us prove that u is right maximal. By contradiction, if u were not right maximal, then by Theorem 
3.2 ux would be a right simple extension of u, and so ux f C°°. This would imply that ux is a word in the 
trie Too, and then that the transition (u,x,v) is a solid edge, a contradiction. 

We now prove that u i s do uble rooted. Suppose by contradiction that, for an integer k > 0, one has 



D k (u) = y e S. By Lemma 3.5 then, D k {ux) = yy or D k (ux) = yy. In both cases, since u is left minimal, 
this would imply that ux is also left minimal. But ux is not a state of Aoo, since the transition (u,x,v) is a 
weak edge. So, by Lemma |4.6[ ux cannot be a left minimal word. 



The word v is left minimal because it is a state of Aoo (by Lemma 4.6 1. Moreover v is a suffix of ux 



by Lemma 2.4 and ux is a right minimal word, since u is a right maximal word (Lemma 3.3). So v is also 
right minimal, and then it is a minimal word. 

It remains to prove that v has root equal to 2. Since u has been proved to be double-rooted, there exists 



k > such that D k (u) = yy, for a y e E. By Lemma 3.5 we have that D k (ux) = yyy or D k (ux) = yyy. In 



the first case ux would be a left minimal word (and we proved that this is not possible), so the second case 
holds. Since v is the longest suffix of ux which is also a left minimal word, it follows that D k (v) = yy, and 
so D k+1 (v) = 2. □ 

We can compact the automaton Aoo by using a standard method for compacting automata, described 
below. We obtain a compacted version of A x , denoted CAoo ■ 
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Let u be a state of Aoo such that u is a right minimal word (and thus u is a minimal word since, by Lemma 
4.6 u is also a left minimal word). Let ux\X2---x n , Xi e £, be the right maximal extension of u. This means 
that for every i, 1 < i < n, the transition {ux\X2---Xi, ar^+i, uxia^-'Xi+i) is the unique edge outgoing from 
Miri2;2-"Xi in wAoo. The procedure for obtaining CA^ from Aoo consists in identifying the states belonging to 
the right maximal extensions of right minimal words. For each right minimal word u in Too, identify all the 
states of its right maximal extension, and replace the transitions of the right maximal extension of u with a 
single transition (u, x\X2---x n , ux\X2---x n ) labeled with the concatenation of the labels of the transitions in 
the right maximal extension of u. In this way, there are exactly two edges outgoing from each state (either 
two solid edges or one solid edge and one weak edge). 

If in Aoo there is a weak edge (u, x, i>), and v' is the right maximal extension of v, then in CAoo there will 
be a weak edge (u,x,v'). The label of this weak edge is then set to be the same word labeling the (unique) 
solid edge ingoing to v' in CAoo ■ 

A partial diagram of the automaton CAoo is depicted in Figure [3] 
The automaton CAoo induces on C 00 a new equivalence, defined by 

u =CA v <=> ' Sa(£,u) and S^(e,v) belong to the right 

maximal extension of the same state. 

Proposition 4.11. Let u,v e C°°. Then u =ca v if and only if u and v have the same maximal extension. 



Proof. Let u,v e C°° and let u' = 8{e,u),v' = S(e,v). Then, by Proposition 4.8 u is a left simple extension 
of u' and v is a left simple extension of v'. On the other hand, by definition of CAoo, u' and v' are right 
simple extensions of the same word w, that, by Lemma |4.6[ is a left minimal word. Thus, u and v are left 
simple extensions of right simple extensions of w, i.e., they are both factors of the maximal extension of 
w. □ 

So, each state of CAoo can be identified with a class of words having the same maximal extension. We 
denote by [tt]c^ the class of u with respect to the equivalence =ca- Every class [w]c^i contains a unique 
shortest element it, which is a minimal word. The other elements in [u]c^t are the simple extensions of u, 
to the left and to the right, up to the maximal extension of u, which is a maximal word (by Remark[5|. By 



Lemma 3.4 all the words belonging to the same class with respect to the equivalence =ca have the same 
height and the same root. Therefore, we can unambiguously define the height and the root of a state in 
C A.00 • 

Remark 8. For every k > 0, there are 2 k states of height k in CAoo- In particular, there are 2 k ~ 1 single- 
rooted states and 2 fc_1 double-rooted states. 

Proposition 4.12. Let u be a state of CAoo- If u is single rooted, then there are two solid edges outgoing 
from u. If instead u is double-rooted, then there are one solid edge and one weak edge outgoing from u. 

Proof. The claim follows from the construction of CAoo and from Proposition |4.10| □ 



5. Vertical representation of C°°-words 

In this section, we introduce a new framework for dealing with C°°-words. We define a function "J for 
representing a C^-word on a three-letter alphabet Eo = {0,1,2}. This function is a generalization of the 
function $ considered in [3], that associates to any C°°-word w = w[0]w[l]- ■ -w[n - 1] the sequence of the 
first symbols of the derivatives of w, that is, the function defined by $(w) [i] = D l (w)[0] for < i < k, where 
k is the height of w. 

If one takes the first and the last symbol of each derivatives of a C^-word w, that is, the pair $>(w), ^(w), 
one gets a representation of C°°-words that is not injective. For example, take the two C°°-words w = 2211 
and w' = 21121221. Then one has $(u>) = $(«/) = 222 and $(w) = = 122. In order to obtain an 

injective representation, we need an extra symbol. We thus introduce the following definition. 
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Definition 11. Let w = w\Q\w[\\ ■ -w[n - 1] be a C°° -word of height k > 0. The left frontier of w is the word 
*(«;) e defined by ^>(w)[0] = w[0] and for < i < k 



®(w)[i] 



ifD i (w)[0] = 2 and £> i_1 (u;)[0] * ^(^[l], 

Z? l (w)[0] otherwise. 



For the empty word, we set ^(e) = e. 

The right frontier of w is defined as *f/(w). If U and V are respectively the left and right frontier of w, 
we call U\V the vertical representation of w. 

In other words, to obtain the left (resp. the right) frontier of w, one has to take the first (resp. the last) 
symbol of each derivative of w and replace a 2 by a whenever the primitive above is not left minimal (resp. 
is not right minimal). 

Example 6. Let w = 21221211221. We have: 



D°(w) 


21221211221 


D\w) 


121122 


D 2 (w) 


122 


D 3 (w) 


2 



The word D 2 (w) = 122 is not a left minimal primitive of the word D 3 (w) = 2, and therefore i f?(w)[3], the 
fourth symbol of the left frontier of w, is a 0; analogously, the word w = 21221211221 is not a right minimal 
primitive of D(w) = 121122, and therefore ^(w)[l], the second symbol of the right frontier of w, is a 0. 
Hence, the vertical representation of w is ^ (w)\ty (w) = 2110|1022. 

Remark 9. By definition, for any C°° -word w of height k > 0, we have that ty(w) = '9(w)[0] i S(w)[l]--- i S(w)[k- 
1] is a word of length k over S whose first symbol is different from 0. Conversely, any word U of length 
k > over So, such that its first symbol is different from 0, is the left frontier of some C°° -word of height k. 

Theorem 5.1. Any word in C°° is uniquely determined by its vertical representation. 

Proof. The claim follows directly from the definition of □ 

Remark 10. Let w be a Cf -word of height k>0 and U\V its vertical representation. Then: 

1. w is left (resp. right) maximal if and only if U[i] + 2 (resp. V[i] + 2) for every i = 1, . . . , k — 1; 

2. w is left (resp. right) minimal if and only if U[i] ± (resp. V[i] + 0) for every i = 1, . . . , k— 1. 

We shall explore further properties of the vertical representation in a forthcoming paper |12| . 

The vertical compacted automaton, noted VCAoo, is obtained from CAoa by replacing the label of each 
state u by the vertical representation of u, and by replacing the labels of the transitions in the following 
way: a solid edge from a state U\V to a state Ux\V , x 6 E, is labeled by x; a solid edge from a state U\V 
to a state U\V is labeled by e; finally, weak edges are labeled by 0. 

A partial diagram of automaton VCAoo is depicted in Figure |4j 

The choice of introducing e-transitions is motivated by the following considerations. By construction, 
each state of the automaton CAoa corresponds to a class of words having the same height and the same 
root. There is a unique minimal word w in each state, and all other words in the same state are the simple 
extensions, on the left and on the right, of w. There are two kinds of minimal words: single-rooted and 
double-rooted. It is easy to see that for every U = U[0]U[l]---U[k - 1] e E fe , U is the left frontier of exactly 
two minimal words of height k: a single-rooted word u>\ having root equal to U[k - 1] and a double-rooted 
word W2 having root equal to U[k - l]U[k - 1]. Moreover, w\ is a prefix of W2- Therefore, if we label with 
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22|22^ 



2221122 I 




Figure 4: The automaton VCA.00 cut at height 3. All states are terminal. 
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£ each transition from the class of Wi to the class of 102, a path in VCAoo starting at the initial state and 
labeled by U ends in a state having label U\V. 

As a consequence, we can thus further compact the automaton by identifying the pairs of states in VCAoo 
that have the same left frontier. This corresponds to identify the class of W\ with the class of w%. In this 
way, each class of words is uniquely determined by the left frontier of its minimal element only. The resulting 
automaton, called the ultra- compacted version of VCAoo, is noted VlACAoo- The transitions are labeled by 
the letters of Eq. The letter is the label of the weak edges, while 1 and 2 label solid edges. The trie formed 
by the solid edges of VlACAoo is a complete binary tree in which there are 2 k nodes at level k representing 
the left frontiers of the minimal words of height k. Each state of VlACAoo different from e has exactly three 
outgoing edges: two solid edges, labeled by 1 and 2, and a weak edge labeled by 0. 

Actually, cutting the infinite automaton VlACAoo at level k, one obtains a deterministic automaton 
VlACAk = (Q,^q-,£,Q,Svuca), where Q = T,- k and S VU ca is defined by: 

1- 8vuca(U, x) = Ux if x e S; 

2- 8vuca(U,0) = V2 if for any u e C°° such that ^>(u) = U0, the longest suffix v of u that is also a left 
minimal word has left frontier equal to = V2. 

A partial diagram of automaton VlACAoo is depicted in Figure [5| Note that the order of the states 
at each level (the lexicographic order in the upper half, and the reverse of the lexicographic order in the 
lower half) makes the graph of the automaton symmetric. This property follows from the symmetry of 
the vertical representation of C°°-words with respect to the swap of the first symbol, which, in turns, 
represents the symmetry of C°°-words with respect to the complement. Indeed, if a word w has left frontier 
U[0]U[l]-U[k - 1], then the word w has left frontier U[0]U[l]-U[k - 1]. 

The automaton VlACAoo induces on the set of C°°-words a natural equivalence defined by 



u=vucav <^=> hucA(z,^(u)) = 6 VUC a(£, 



We denote the class of u with respect to this equivalence by [it]vwc.4- 

Proposition 5.2. Let u,v e C°° . Then u =vuca v if o/nd only if the left maximal extension of u and the 
left maximal extension of v have the same left frontier. 

Proof. The claim is a direct consequence of the construction of VlACAoo ■ □ 
We end the section by discussing an interesting property of the automaton VlACAoo- Let w be a C°°- 



word. By Theorem 5.1 w is uniquely determined by its vertical representation ^ (w)^ (w) . Moreover, w is 



a si mple extension of a unique minimal word w' having the same height and the same root as w (Lemma 



3.4). To get the vertical representation ^f(w')\^/(w') of the word in' , one can use the automaton VlACAoo- 
Indeed, ^(w) is the label of a unique path in VlACAoo starting at the origin and ending in a state U. Then 
U is the left frontier of w', i.e., U = $(«/). Analogously, ^(w) is the label of a unique path in VlACAoo 
starting at the origin and ending in a state V, and V is the right frontier of w' , i.e., V = ^(w'). 

Example 7. Let w = 21221211221 as in Example^ The vertical representation of w is 2110|1022. Looking 
at the graph of the automaton VlACAoo (Figure^) we see that the path starting at the origin and labeled by 
2110 ends in state 2122, while the path starting at the origin and labeled by 1022 ends in state 2222. Thus, 
the minimal word of which w is a simple extension is the word w' having vertical representation 2122|2222 7 
that is, the word w' = 2121122. 

Example 8. Let w = 1221221121. The vertical representation of w is 101 1 110. Looking at the graph of the 
automaton VlACAoo (Figure^ we see that the path starting at the origin and labeled by 101 ends in state 
221, while the path starting at the origin and labeled by 110 ends in state 122. Thus, the minimal word 
of which w is a simple extension is the word w' having vertical representation 221|122, that is, the word 
w' = 2212211. 
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Figure 5: The automaton VUCAoa cut at height 4. All states are terminal. The order of the states at each level is the 
lexicographic order in the upper half, and the reverse of the lexicographic order in the lower half. This makes the graph of the 
automaton symmetric. 
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6. C°°-words of the form uzu 

In this section, we use the structure of VUCAoo for deriving an upper bound on the length of the gap 
between two occurrences of a C°°-word. Recall that a repetition with gap n of the C°°-word u is a C°°-word 
of the form uzu such that \z\ = n. Carpi [5 proved that for every n > there arc finitely many repetitions 
with gap n in C°°. In a more recent paper, Carpi and D'Alonzo [7] proved that the repetitivity index of 
C°°-words is ultimately bounded from below by a linear function. The repetitivity index [5] is the integer 
function / defined by 

J(n) = min{fc > | 3u e C°°, |u| = n : uzu € C°° for a z such that \z\ = k}. 

In other words, I(n) gives the minimal gap of a repetition of a word u of length n in C°°. 

We now explore the relationship between the length and the height of a C°°-word. For any C°°-word u>, 
we have 

|D(u;)| + 2\D(w)\ 2 < \w\ < \D(w)\ + 2\D(w)\ 2 + 2. 

Chvatal [5] proved that the upper density of 2's in a fc-differentiable word, for k > 22, is less than p = 0.50084. 
Hence, we can suppose that for every C°°-word w of height k > 22 one has 

(2-p)\D(w)\<\w\<(l+p)\D(w)\ + 2. (1) 

We thus have the following lemma. 

Lemma 6.1. There exist positive constants a and ft such that for any Cf -word w 

a{2-p) h{w) < \w\ <P(1 + p) h{w \ (2) 



and therefore 



where h(w) is the height of w. 



log H- log /9 logH-loga 
log(l+p) V 1 log(2-p) V ' 



Theorem 6.2. Let u e (T° . Then there exists z e C°° such that uzu e (T° and \uzu\ < C|u| 2 ' 72 , for a suitable 
constant C . 

Proof. Let u be a C°°-word of height h[u). Without loss of generality, we can suppose that u is a maximal 
word. Indeed, if a word w' is the maximal extension of a word w, then a word of the form w'z'w', z' € £*, 
contains a word of the form wzw as factor, for azeS*. Moreover, < |u/z'u>'|. 

Let V = ^(u) be the left frontier of the word u. Then U' is the label of a unique path in VUCA^ 
starting at the initial state and ending in a state U . Consider the paths in VUCAoo outgoing from the state 
U. Since each state of VUCAoo has exactly three outgoing edges, there are 3™ distinct paths of length n 
starting in U. Each of these paths ends in a state W such that \W\ = h(u) + n. Since there are 2 n+h ^ 
distinct states W such that IV^I = h(u) + n, by the pigeonhole principle there will be two distinct paths 
of length n starting in U and ending in the same state, say V, whenever 3™ > 2 n+h ( u \ that is, whenever 
n > 7/i(w), where 7 = (log 2 3 - ^ 1.70951. 

So there exists a state V in VUCAoo such that \V\ < \(1 + j)h(u)] and there are two distinct paths, say 
Vi and V2, from U to V. 

Thus, there exist two distinct C°°-words v\ and v 2 such that ^{v\) = U'V\ and ^(t^) = U'V 2 (and this 
implies that u is prefix of both v\ and v 2 ), and v\ =vuca v 2 . Moreover, if v is a C°°-word such that ^S(v) = V , 
we can suppose that V\ and v 2 are two distinct left simple extensions of v. Hence, we can suppose that one 
of the two words (say v\) is a suffix of the other (i^)- This implies that u appears as a prefix of v 2 and has 
at least a second occurrence as a (proper) factor in v 2 . 

We suppose that the two occurrences of u in v 2 do not overlap. Actually, the set C°° does not contain 
overlaps of length greater than 55 (since every overlap contains two squares 0), so our assumption consists 
in discarding a finite number of cases, that can be included in the constant C of the claim. 
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Thus, we can write v 2 = uzuv' 2 , for an^S*, and we have 



h(uzu) < h(v 2 ) < \(1 + j)h(u)] < (1 + ■y)h(u) + 1. 
By Equations [2] and [3j we have 

\uzu\ < P(l + p) {1+l)h{u)+1 
< /3(l+p)(l+p) { 



>fl+-Yl l0g|n| ~'° SC 



0(1 +P) , ,(l +7 )l55(i±£) 

lyP- " log(2-p) 



/I I -A 1°S(1 + P) 

a} ±+ '> io g (2- P ) 



where C is a constant and 7' 2 2.71701. 



□ 



In the proof of Theorem 6.2 we do not exhibit the word uzu of the claim. Actually, to obtain such 
a word, one has to explore (a finite portion of) the graph of VUCA X . We do not know whether a direct 
construction of the word uzu is possible using another approach. 

As a direct consequence of Theorem |6.2| we have a sub-cubic upper bound on the length of a repetition 
(with gap) of a C°°-word. 

Let us define the function 

G(n) = minjfc | Vu e C°°, \u\ = n, 3z : \z\ < k,uzu e C°°}. 

The function G is a dual function with respect to the repetitivity index I(n). As a consequence of 
Theorem 16.21 we have: 



Corollary 6.3. G(n) = o(n 3 ). 



7. Conclusion 

In this paper we exhibited different classifications of C°°-words based on simple extensions, by means of 
graphs of infinite automata representing the set of C°°-words. Our approach makes use of an algorithmic 
procedure for constructing deterministic automata, but the main interest in using this approach is that 
this allows us to define a structure (the graph of the infinite automaton) for representing the whole set of 
C 00 -words. 

The vertical representation of C°°-words introduced in Section [5] leads to a more compact automaton 
representing C°°-words, VUCA^, keeping at the same time all the information on the words. Indeed, 
this novel representation allows one to manipulate C°°-words without requiring detailed knowledge of the 
particular sequence of Is and 2s appearing (or not) in them. In a forthcoming paper we will discuss more 
in depth the properties of the vertical representation of C°°-words |12) . 



In Theorem 6.2 we gave an upper bound on the length of a repetition with gap of a C°°-word. It is a 
dual result with respect to the lower bounds obtained by Carpi [51 [7] . Numerical experiments suggest that 
a tighter bound on the gap of a repetition of a C°°-word u could be sub-quadratic in the length of u. 



The proof of Theorem 6.2 does not allow one to build a repetition of a C°°-word directly, that is, without 
using the graph of the automaton VUCAcx,. However, this is a consequence of the particular approach we 
used. In fact, most of the known results about the existence of particular patterns in C°°-words make 
use of standard methods in combinatorics on words, while we think that the techniques we developed in 
this paper represent a novel approach to the study of C°°-words. We hope that this will stimulate further 
developments, eventually leading to the solution of Problem [l] and, perhaps, to a proof of at least some of 
the longstanding conjectures on the Kolakoski word. 
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